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Abstract — The problem of opportunistic spectrum access in 
cognitive radio networks has been recently formulated as a non- 
Bayesian restless multi-armed bandit problem. In this problem, 
there are N arms (corresponding to channels) and one player 
(corresponding to a secondary user). The state of each arm 
evolves as a finite-state Markov chain with unknown parameters. 
At each time slot, the player can select K < N arms to 
play and receives state-dependent rewards (corresponding to the 
throughput obtained given the activity of primary users). The 
objective is to maximize the expected total rewards (i.e., total 
throughput) obtained over multiple plays. The performance of 
an algorithm for such a multi-armed bandit problem is measured 
in terms of regret, defined as the difference in expected reward 
compared to a model-aware genie who always plays the best K 
arms. In this paper, we propose a new continuous exploration and 
exploitation (CEE) algorithm for this problem. When no infor- 
mation is available about the dynamics of the arms, CEE is the 
first algorithm to guarantee near-logarithmic regret uniformly 
over time. When some bounds corresponding to the stationary 
state distributions and the state-dependent rewards are known, 
we show that CEE can be easily modified to achieve logarithmic 
regret over time. In contrast, prior algorithms require additional 
information concerning bounds on the second eigenvalues of the 
transition matrices in order to guarantee logarithmic regret. 
Finally, we show through numerical simulations that CEE is 
more efficient than prior algorithms. 

I. Introduction 

Multi-arm bandit (MAB) problems are widely used to make 
optimal decisions in dynamic environments. In the classic 
MAB problem, there are N independent arms and one player. 
At every time slot, the player selects K(> 1) arms to sense 
and receives a certain amount of rewards. In the classic non- 
Bayesian formulation, the reward of each arm evolves in i.i.d. 
over time and is unknown to the player. The player seeks to 
design a policy which can maximize the expected total reward. 

One interesting variant of multi-armed bandits is the restless 
multi-arm bandit problem (RMAB). In this case, all the arms, 
whether selected (activated) or not, evolve as a Markov chain 
at every time slot. When one arm is played, its transition 
matrix may be different from that when it is not played. Even 
if the player knows the parameters of the model, which can be 
referred to as the Bayesian RMAB since the beliefs on each 
arm can be updated at each time based on the observations 
in this case, the design of the optimal policy turns to be a 
PSPACE hard optimization problem [2). 

In this paper, we consider the more challenging non- 
Bayesian RMAB problems, in which parameters of the model 



are unknown to the player. The objective is to minimize regret, 
defined as the gap between the expected reward that can be 
achieved by a suitably defined genie that knows the parameters 
and that obtained by the given policy. As stated before, finding 
the optimal policy, which is in general non-stationary, is P- 
SPACE hard even if the parameters are known. So we use 
instead a weaker notion of regret, where the genie always 
selects the K most rewarding arms that have highest stationary 
rewards when activated. 

We propose a sample mean-based index policy without 
information about the system. We prove that this algorithm 
achieves regret arbitrarily close to logarithmic uniformly 
over time horizon. Specifically, the regret can be bound by 
ZiG(n)\nn + i^lnn + Z^G(n) + Z4, where n is time, 
Zi, i = 1, 2, 3, 4 are constants and G(n) can be any divergent 
non-decreasing sequence of positive integers. Since the growth 
speed of G(n) can be arbitrarily slowly, the regret of our 
algorithm is nearly logarithmic with time. The significance 
of such a sub-linear time regret bound is that the time- 
averaged regret tends to zero (or possibly even negative since 
the genie we compare with is not using a globally optimal 
policy), implying the time-averaged rewards of the policy 
will approach or even possibly exceed those obtained by the 
stationary policy adopted by the model-aware genie. 

If the some bounds corresponding to the stationary state dis- 
tributions and the state-dependent rewards are known, we show 
that the algorithm can be easily modified and achieves loga- 
rithmic regret over time. Compared to prior work j6] Q fl4l . 
our algorithm requires the least information about the system; 
in particular, we do not require to know the second largest 
eigenvalue of transition matrix or multiplicative symmetriza- 
tion matrix. Moreover, our simulation results show that our 
algorithm obtains the lowest regret compared to previously 
proposed algorithms when the parameters just satisfy the 
theoretical boundaries. 

Research in restless multi-arm bandit problems has a lot 
of applications. For instance, it has been applied to dynamic 
spectrum sensing for opportunistic spectrum access in cogni- 
tive radio networks, where a secondary user must select K of 
N channels to sense at each time to maximize its expected 
reward from transmission opportunities. If the primary user 
occupancy on each channel is modeled as a Markov chain with 
unknown parameters, then we obtain an RMAB problem. We 
conduct our simulation-based evaluations in the context of this 
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particular problem of opportunistic spectrum access. 

The remainder of this paper is organized as follows: in 
Section [II] we briefly review the related work on MAB 
problems. In Section [III] we formulate the general RMAB 
problem. In Section [IV] and Section [VJ we introduce a sample 
mean based policy and provide a proof for the regret upper 
bound separately for single and multiple channel selection 
cases. In Section[VT] we evaluate our algorithm and compare it 
via simulations with the RCA algorithm proposed in [14] and 
the RUCB proposed in [6| for the problem of opportunistic 
spectrum access. We conclude the paper in Section \VU\ 

II. Related Work 

In 1985, Lai and Robbins proved that the minimum regret 
grows with time in a logarithmic order IfTZl . They also 
proposed the first policy that achieved the optimal logarithmic 
regret for multi-armed bandit problems in which the rewards 
are i.i.d. over time. Their policy only achieves the optimal 
regret asymptotically. Anantharam et al. extended this result to 
multiple simultaneous arm plays, as well as single-parameter 
Markovian rested rewards |4j. Auer et al. developed UCB1 
policy in 2002, applying to i.i.d. reward distributions with 
finite support, achieving logarithmic regret over time, rather 
than only asymptotically in time. Their policy is based on the 
sample mean of the observed data, and has a rather simple 
index selection method. 

One important variant of classic multi-armed bandit problem 
is the Bayesian MAB. In this case, a priori probabilistic 
knowledge about the problem and system is required. Gittins 
and Jones presented a simple approach for the rested bandit 
problem, in which one arm is activated at each time and 
only the activated arm changes state as a known Markov 
process [8 1. The optimal policy is to play the arm with highest 
Gittins' index. The restless bandit problem was posed by 
Whittle in 1988 Q, in which all the arms can change state. 
The optimal solution for this problem has been shown to 
be PSPACE-hard by Papadimitriou and Tsitsiklis 0. Whittle 
proposed an index policy which is optimal under certain 
conditions |9|. This policy can offer near-optimal performance 
numerically, however, its existence and optimality are not guar- 
anteed. The restless bandit problem has no general solution 
though it may be solved in special cases. For instance, when 
each channel is modeled as identical two-state Markov chain, 
the myopic policy is proved to be optimal if the channel 
number is no more than 3 or is positively correlated [ 10] ifTTl . 

There have been a few recent attempts to solve the restless 
multi-arm bandit problem under unknown models. In fl4l . 
Tekin and Liu use a weaker definition of regret and propose 
a policy (RCA) that achieves logarithmic regret when certain 
knowledge about the system is known. However, the algorithm 
only exploits part of observing data and leaves space to 
improve performances. In (6|, Haoyang Liu et al. proposed 
a policy, referred to as RUCB, achieving a logarithmic regret 
over time when certain system parameters are known. The 
regret they adopt is the same as in |fl4 1. They also extend the 
RUCB policy to achieve a near-logarithmic regret over time 



when no knowledge about the system is available. Conclusions 
on multi-arm selections are given in [7|. However, they only 
give the upper bound of regret at the end of a certain time 
point referred as epoch. When no a priori information about 
the system is known, their analysis of regret gives the upper 
bound over time only asymptotically, not uniformly. 

In our previous work [5], we adopted a stronger definition 
of regret, which is defined as the reward loss with the optimal 
policy. Our policy achieve a near-logarithmic regret without a 
prior of the system. It applies to special cases of the RMAB, 
in particular the same scenario as in flOl and ifTTIl . 

III. Problem Formulation 

We consider a time-slotted system with one player and 
N independent arms. At each time slot, the player selects 
(activates) K(< N) arms and gets a certain amount of rewards 
according to the current state of the arm. Each arm is modeled 
as a discrete-time, irreducible and aperiodic Markov chain 
with finite state space. We assume the arms are independent. 
Generally, the transition matrices in the activated model and 
the passive model are not necessarily identical. The player 
can only see the state of the sensed arm and does not know 
the transitions of the arms. The player aims to maximize its 
expected total reward (throughput) over some time horizon 
by choosing judiciously a sensing policy <f> that governs the 
channel selection in each slot. Here, a policy is an algorithm 
that specifies arm selection based on observation history. 

Let S l denote the state space of arm i. Denote r x the reward 
obtained from state x of arm i, x € S z . Without loss of 
generality, we assume r* < l,Vx € S l ,Vi. Let Pj denote the 
active transition matrix of arm j and Qj denote the passive 
transition matrix. Let tt 1 = {tt x ,x <E S 1 } denote the stationary 
distribution of arm i in the active model, where tt x is the 
stationary probability of arm i being in state x (under Pj). 
The stationary mean reward of arm i, denoted by /i\ is the 
expected reward of arm i under its stationary distribution: 

Consider the permutation of {1, • • • , N} denoted as a, such 
that > p a ^ > > • • • p a{ - N \ We are interested in 

designing policies that perform well with respect to regret, 
which is defined as the difference between the expected reward 
that is obtained by using the policy selecting K best arms and 
that obtained by the given policy. The best arm obtains the 
highest stationary mean reward. 

Let F* it) denote the reward obtained at time t with policy 
The total reward achieved by policy $ is given by 

t 

fl*(t) =^Y*(i) (2) 

3=1 

and the regret r* (t) achieved by policy $ is given by 

K 

»•*(*) =*XX a) -E(J2*(i)) (3) 
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The objective is to minimize the growth rate of the regret. 

IV. Analysis for Single Arm Selection 

In this section, we focus on the situation when K — 1. In 
this case, the player selects one arm each time. We first show 
an algorithm called Continuous Exploration and Exploitation 
(CEE) and then prove that our algorithm achieves a near- 
logarithmic regret with time. 

A. The CEE Algorithm for non-Bay esian RMAB 

Our CEE algorithm (see Algorithm 1) works as follows. 
We first process the initialization by selecting each arm for 
certain time slots (we call these time slots step), then iterate 
the arm selection by searching the index that maximizes the 
equation shown in line [8] in Algorithm 1 and operating this 
arm for one step. A key issue is how long to operate each 
arm at each step. It turns out from the analysis we present in 
the next subsection that it is desirable to slowly increase the 
duration of each step using any (arbitrarily slowly) divergent 
non-decreasing sequence of positive integers {-Bj}^- 

A list of notations is summarized as follows: 

• n: time. 

• Be duration of i t h step. 

• Ai(ij): sample mean of the ijth step arm i being 
selected. 

• Xj\ sum of sample mean in all the steps arm i being 
selected. 

Algorithm 1 Continuous Exploration and Exploitation (CEE): 
Single Arm Selection 
i: // Initialization 

2: Play arm i for Bi time slots, denote Ai(l) as the sample 
mean of these Bi rewards, i = 1, 2, • • • , N 

3: Jti= A i (l),i = 1,2,- ■■ , N 



n = J2 l= i B i 
i = N + l,ij 



Lin 



^(L can be 



1, j = 1,2,--- ,N 

II Main loop 
while 1 do 

Find j such that j = arg max ■ 
any constant greater than 2) 

9: ij =ij + l 

10: Play arm j for Bi slots, let Aj(ij) record the sample 

mean of these Bi rewards 

11: X^Xi+A^) 

12: i = i + 1 

13: ri = n + Bi; 

14: end while 



B. Regret Analysis 

We first define the discrete function G(n), which represents 
the value of Bi, at the n th time step in Algorithm 1: 



G(n) = min Bj s.t. ^B { > 



(4) 



Since Bi > 1, it is obvious that G(n) < B n ,\/n. Note that 
since Bi can be any arbitrarily slow non-decreasing diverging 
sequence, G(n) can also grow arbitrarily slowly. 

In this subsection, we show that the regret achieved by our 
algorithm has a near-logarithmic order. This is given in the 
following Theorem Q] 

Theorem 1: Assume all arms are modeled as finite state, 
irreducible, aperiodic and reversible Markov chains. All the 
states (rewards) are positive. The expected regret with Algo- 
rithm 1 after n time slots is at most Z±G(n) In n + Z% Inn + 
Z%G{ri) + Z4, where Z±, Z2, Z3, Z4 are constants only related 
to Pi, i = 1, 2, • • • ,N, explicit expressions are at the end of 
proof for Theorem Q] 

The proof of Theorem Q] uses the following fact and two 
lemmas that we present next. 

Fact 1: (Chernoff-Hoeffding bound) Let X\,--- ,X n be 
random variables with common range [0, 1] and such that 
E[X t |X 1; • • • , Xt-i] = /i. Let S n =X 1 + -" + X n . Then 
for all a > 



?{S n > n^i+a} < e~ 2a2/n ;¥{S n < n/j-a} < 



,-2a 2 /n (5) 



The first lemma is a non-trivial variant of the Chernoff- 
Hoeffding bound, first introduced in our recent work Q, 
that allows for bounded differences between the conditional 
expectations of sequence of random variables that we revealed 
sequentially: 

Lemma 1: Let X±, ■ ■ ■ , X n be random variables with range 
[0,6] and such that |ELY t |Ar 1; • • • ,X 4 _i] - p\ < C. C is a 
constant number such that < C < /i. Let S n — X\ + ■ ■ ■ + 
X n . Then for all a > 0, 

P{S n > n{n + C) + a} < e ~ 2{ T^T )2/n (6) 



and 



"{Sn < n(jt -C)-a}< e ^ a ^ 2 / n 



(7) 



Proof: We first prove (O. We generate random variables 
X±,X2, ■ ■ ■ , X n as follows: 

X 2 = (/j + C) 

X t = (n + C) 
Note that 



X-2 



E[X 2 |X! 



E[X t \Xi,X 2 ,- ,X t 



So we have 



S[**|Xi,-" ,*f-i]-Ml <C 



[X t \X u --- ,X t - X ]-ii\ <c 



i=l 



Since ^ is at least 1, at most X\,Xi,--- ,X n 

have finite support (they are in the range [0, 6^^]). Besides, 

E[x t \Xi,--- ,x t _ 1 ) = n + c, yt. 

Let S n = Xx + X 2 H h X n , then for all a > 0, 

P{S„ > nQj, + C)+a}< ¥{S n > n(/i + C) + a} 
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The first inequality stands because ^ > l,Vt. The second 
inequality stands because of Fact 1 . 

The proof of (0 is similar. We generate random variables 
X(,X£,--- ,X' n as follows: 

X( = (/i- C)gp^j, 



x; - (/i - c) 

Note that 



So we have 



a-; 



|E[X t |X lr -- ,X t -i]-n\ < C 



\E[Xl\X[,.- ,XU}- M | <C 



^ is at most 1, at least j^j, therefore Xi,X2,--- , X n 
have finite support (they are in the range [0, b]). Besides, 
E[X' t \X[,--- ,X' t _ 1 }= t ,-C,Vt 

Let S' n = X( + X 2 + • • ■ + X4, then for all a > 0, 



»{£„ < - C) - a} < F{S' n < n(ji - C) - a] 

< e -2(a/6) 2 /" 



(9) 



The first inequality stands because < l,Vt. The second 
inequality stands because of Fact 1 . ■ 

Lemma 2: J4] Consider an irreducible, aperiodic Markov 
chain with state space S, matrix of transition probabilities P, 
an initial distribution q which is positive in all states, and 
stationary distribution tt(tt s is the stationary probability of 
state s). The state (reward) at time t is denoted by s(t). Let /j, 
denote the mean reward. If we play the chain for an arbitrary 
time T, then there exists a value Ap < (mm se s ^s) -1 2~2ses s 
such that EEf =1 s(t) - /J.T] < A P . 

Lemma |2] shows that if a player keeps selecting the optimal 
arm, the difference between the expected reward and the 
highest stationary reward is bounded by a constant. Hence 
if the player switches from the optimal arm to one another, 
the reward loss caused by switching can be bounded. 

Based on these two lemmas, we can give the proof of 
Theorem Q] show as below. 

Proof: Since K = 1, is the index of the optimal arm. 
The regret comes from two parts: the regret when selecting an 
arm other than arm crW; the difference between /i"^ 1 ) and 
E(Y*(i)) when selecting arm <r' From Lemma [2] we know 
that each time when we switch from arm cr 1 ) to one another, 
at most we lose a constant value from the second part of 
the regret. If the number of selections of one arm other than 
er^ 1 ) in line [8] is bounded by O(lnn), the first part of regret 
can be bounded by 0(G(n) Inn) and the second part can be 
bounded by ApOilnn), and the total regret can be bounded 
by 0(G(n)\nn). So next we will show this is true. 

For ease of exposition, we discuss the time slots n such 
that G\\n, where G\\n denotes the time n is the end of certain 
step. 

We define q as the smallest index such that 

B q > rmax{ ^ (1) 2 ^ g(2) , -gfc , I = 1, 2, ■ • • , N}] (10) 



where 



Let 



C P = mas . {(minO 1 ^ s} 

l<i<N igS 1 A — '. 



sES 1 



c t , s = y/(L]nt)/s 
a* = q(^ - ^) 



and 



w 



V (i) +C P /B, 



Cp 
B a 



1 



(ID 



(12) 



Next we will show that it is possible to define a* such that 
if arm cr(l) is selected for s(> a*) steps, then 



exp(-2(«;* - sc M ) 2 /(s - ?)) < ^ 



(13) 



In fact, when s > max{g, \w* /(y/L — -\/2)l 2 }, we have 



r Ls-w* > y/2{s-q) 



Consider 



f(t) = VLslnt-w* - y / 2(s-q)\nt, V< > e 
Since f(t) is an increasing function and /(e) > 0, we have 
f(t) > 0,Vt > e 



i.e. \/ Ls In t — w* > y/2(s — q) In t. And this equals to 
exp(-2(w* - sc t ^) 2 /(s - q)) < i" 4 
Thus at least we can set 

a* = 1 + \m&x{q, [w*/(VL - V2)} 2 }~\ 
For the similar reason, we could define 

a 1 = 1 + [maxjg, [w l /{VI - V2)} 2 }~\ 
such that if arm a(i) is selected for s(> a 1 ) steps, 



(14) 



(15) 



exp( ; — ) <t 



s- q 

Moreover, we will show that there exists 



(16) 



7 = [max{(7V - l)(4a* + 1) + a*, (N - l)e 4Q * /i + a*, 
r {(N - l)(4a 4 + 1) + a\ (N - \)e^l L + a 2 }}] 

(17) 



max 

2<i<N ' 



such that for the time n, if G(n) > B 1 , then arm <r(l) is 
selected at least a* times and arm a(i) is selected at least a 1 
times. 

In fact, if arm <r(l) has been selected less than a* times, 
consider arm j being selected for the most steps. Consider the 
last time selecting arm j, denote that time as t, there must be 



X a (i 



Mi) 



Xj 

+ Ct,i„ w < — + c Mj 
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Since arm j has been selected the most times, we have Note that X a n\ Sl = ^(i),i + Ar(i),2 + • • ■ + ^U(i).si, 
ij > max{4a* + l,e 4a * /L }. Noting that > 0, < 1, where A a ^) A is sample average reward for the i th step 

V(i) < a* — 1, ij > 4a* + 1, we have " (1> selecting arm cr(l). From Lemma 12 we have 



Lint Lint r< n 

+ \l - < 1 + a/ . — - ,^(i) _ i^P wri. .1 ^ „<Ki) _,_ ^£ 



U + V 

Consider 



4a * + 1 ~ TT < E[i M ] < + ^>q (22) 



4a* + 1 V a* - 1 



Then applying Lemma Q] and the results in ( [T3T > and dT6b , 



, Llni / Lint W e have: 

gyp) = i + 



Since #(£) is a decreasing function and t > Yli=i &i — Sl q 

e4ov '™ ha ™ ^ =p ( i ^' + --- +i °''>>-- <^>-^- a ,,) 

V 4a* + 1 V a* - 1 < + ■ ■ ■ + + Ax(i), g +i + ; ; ■ + Ax(i), Sl < a{X) 

This contradicts the conclusion above. So arm cr(l) has been si 
played at least a* times. _ Cp _ c \ 

If we replace a* with a 1 and replace arm er(l) with arm _B g ,Sl 

cr{i), without changing the proof, we can conclude that arm < exp(— 2(w* — set Sl ) 2 /(si — ?)) < t -4 
cr(i) has been played at least a* times. (23) 

Next we will bound the number of times we fail to choose 
the optimal arm. We will show that this number has a f n a-(i) r> /d 

logarithmic order. P(±^ > f<J) + ^ + ^ + g P ^ c t ,, J ) 

Denote Tj(n) as the number of times we select arm a(j) i q " p ' 9 

up to time n. Then, for any positive integer I, we have A a ^ t i + • • ■ + A a ^ >s . Cp 

n • — f l ' 

t=J2iLi Bi,c\\t °W > + _ Cp/Bq c ^) 

< + *,<,} < P( 1 + --- + 1 + i ^ +1+i ^ > M *Cfl + ^ 



(^^<^ (1) -^-c Ml ) 



c t , Sj ) 



(25) 



„ Q(t),t=Bi+-+B Q(t) /3(t),i=Bi+-+-B 3(t) n°ti) - C P /B k 

E E £ -2(t^+ sc ^) 2 4 

t=B 1 H h-B-r,G||t si=a* Sj =max(ai,i) S exp^ — — ) S t 

I{^^+c Ml <^ + Ct , S3 .} ' (24) 

Sl S J Denote A,(n) as 

(18) n 7 

where I{a:} is the index function defined to be 1 when the Xj{n) = \(L(1 + , q ) 2 lnn)/(^' T(1) - 

predicate a; is true, and when it is a false predicate; i a {j) {t) " p ' 9 

is the number of times we select arm a(j) when up to time _ ) 2 ~\ 

t, Vj = 2, • •• , TV; X .( J -)(t) is the sum of every sample mean -B<j 
of arm <r(j) for i a ^(t) plays up to time t; X a ^ iS . is the 

sum of every sample mean for Sj times selecting arm <r(j). For I > Xj(n), dHJ is false. So we get: 

The condition {^Slh^ + Cf si < + Ct s J implies E(T,(n)) < A,(n) + 7 + Eg^^^E* . =1 2t-* 

that at least one of the following must hold: 2 ^26) 

Y n < +7+ — . 

^Wzfi < ^(D _ ^ _ ^ (19 ) 3 

si B q As we analysis before, the first part of the regret is bounded 

„.m cr^i^ifii ^ + ^/^ v (2i) 

^ B g 5g ^ - Cp/S/ * ,Si 1 ; and the second part is bounded by C P ^ 2 E ( T j( n )- 



6 



Therefore, we have: 

r*(n) < G{n) + 



N 2 

X)(G(n)(^W - + 3C F )(A j (n) + 7 + ^) 



3=2 



(27) 



This inequality can be readily translated to the simplified 
form of the bound given in the statement of Theorem 1, where: 



jv 



1 ^ ' jl ( /iCT (l)_ /iCT ( J )_2C il )2l 



i=2 



M 1 -T (pW-Cp/Bj 



3=2 



(^(1) _ ^(j) _ 2Cry 



N 



1 



^3 = (7+ ! V)E(^ (1) -^ ) ) + l 



J'=2 



Z 4 = 3(AT - 1)C P ( 7 + y ) 



C. Corollary 

From the analysis above, we see that if sequence 
{Bi}f =l is constant and > [max{ ^J^w ■ ffir.' = 
1,2, ••■ ,-ZV}], then Algorithm 1 achieves logarithmic regret 
over time. Specifically, we have the following corollary: 

Corollary 1: The system model is the same as that in 
Theorem Q] In Algorithm 1, if 

B * = T max { _ , ^ , i = 1, 2, • • • , Vi e M 

then the expected regret after n time slots is at most 
Z' X B X In n + Z' % In n + Z£-Bi + where 



N 



<r(l) 



i=2 



w, , m <7(J) +Cp/Bi n2 
' , t 1 1 , „ . ; , 2Cp \2 ' 



JV 



Z 2 = 3C P J2\ 



3=2 



1 



N 



Z' 3 



= (7i + y)E(^ (1) 
6 3=2 



Zi = 3(JV-l)Cp( 7l + y) 



1 in CED, (E), CTQ, (Eli 



and here 71 is obtained given 
and ([nil- 
Remark: This corollary is just a special case for Theorem 
[TJ but it reveals the fact that when certain knowledge of the 
system is available (in this case, some bounds related to the 
stationary state distribution and state-dependent rewards), we 
can design an algorithm that achieves logarithmic regret over 
time. 



V. Analysis for Multi-Arm Selection 

In this section, we discuss the general case where K is a 
known positive integer. We show a generalization of the CEE 
algorithm and prove that it still achieves a near-logarithmic 
regret with time. 

A. Algorithm Design 

The basic idea is similar to Algorithm 1 : first initialize and 
then find the optimal indices. The only difference is here we 
have to select K indices that obtain the greatest value in line 
[8] at one time. The definition of {Bi} c *L l stays the same and 
the details are shown in in Algorithm [2] 

Algorithm 2 Continuous Exploration and Exploitation (CEE): 

Multi-Arm Selection 

u // Initialization 

2: Sequently play K arms Bi times until every arm is 
selected once, i — 1,2, ■•■ , f$]. Denote Aj as the 
sample mean of the corresponding Bi rewards of arm j , 

i = 1,2, ••• , j = 1,2,"- ,N 
3: Xi = A h i = 1,2,- ■■ ,N 

* « = £L A i B t 
5: i=rf] + l, i i = l,i = l,2, 

6: // Main loop 
7: while 1 do 

8: Denote F(j) 

larger than 2) 
9: Find arm j\ , j'2 , • • • , jx such that 

F(.h)>F(j 2 )>--->F(j K )>F(l) 
VZ i {ji,32,'" ,3k} 
10: ij t = ij t + 1, 1 < I < K 

11: Select arm 31,32, • ■ ■ ,3k and play for Bi times, let 
Aj t ) record the sample mean of these Bi rewards 

13: i = i+l 
14: n = n + Bi\ 
15: end while 



,N 



Xi 



L ■"" ( L can be any constant 

1 



B. Regret Analysis 

In this subsection, we keep the definition of G(n) in (@]l and 
the definition of regret in y). We will show that the regret 
achieved by Algorithm [2] has a near logarithmic order. This is 
given in the following Theorem [2] 

Theorem 2: Assume all arms are modeled as finite state, 
irreducible, aperiodic and reversible Markov chains. All the 
states (rewards) are positive. The expected regret with Algo- 
rithm [2] after n time steps is at most Z^G(n) In n + Z§ In n + 
ZjG(n) + Zg, where Z$, Zq, Zj, Z$ are constants only related 
to Pi, i — 1, 2, • • • , 7Y, explicit expressions are at the end of 
proof for Theorem 12 

Proof: The proof of Theorem [2] is similar to that of 
Theorem Q] We still divide the regret into two parts and bound 



7 



them separately. We keep the denotation of G\\n and discuss 
the time slots such that G\\n. 

We define q' as the smallest index such that 



B q i > |~max{ 



2C F 



C f 



fjcr(K) _ fjcr(.K+l) ' ^{l) 



,/ = l,2,... ,JV}1 



Let 



(28) 
(29) 



and 

(30) 

As shown in the proof of Theorem Q] if we set 

/3* = 1 + V2)] 2 }1, 1 < j < K (31) 

= l+[max-{y, [mV(VS-V^)] 2 }l,iir+l <i<N (32) 
and if s > /3* and s > (3 l we will have 



-2(m* - sc t s ) 2 . 
ex P ( 1 J , t - S ' ) < r 4 . 



and 



exp(- 



S — q 



-2(m j + sct r 



< t~ 



s — q' 

Moreover, we will show that there exists 



(33) 



(34) 



V = [max( i max^{(iV - l)(5/3* + 1) + /3*, (N - l)(e 4 ^ 
+ /?*) + ^max^fJV - l)(5/3* + 1) + /3 l , (N- 
l)(e 4 ^/£ +/3 *) +/3 *})l 



4/3*/i 



(35) 

such that for the time n, if G(n) > B 7 >, then arm a(j) is 
played at least (3* times and arm cr(i) is played at least f3 l 
times, where l<j<K,K + l<i<N. 

In fact, if arm er(j') has been played less than /3* times, 
then there exist an arm a(l)(K + 1 < I < N) that has been 
played the most times. Consider the last time that arm a (I) is 
selected and arm a(j) is not selected, and denote that time as 
t; Then it must be true that 

jzCjj + r < X < 1 ) + r 

— r C t j . . ^ — h Cj j ... 

Vy) " (3) MO <0 
Since arm cr(Z) has been played the most times, we have 
V (0 > max{4/3* + l,e 4 ^/ L }. Noting that ^ > 0, ^ < 
1, Mj) ^ ^ - LMO ^ + L we have * 3 



< 1 



Consider 





9*(t) = l + 



Lint 



4/3* 



1 




Since g* (t) is a decreasing function and i > Yli=i &i — 
e 4/3 J we have 



;?*(£) <<7*(e 4 ^ /L ) = l + 



4/3* 



4/3* 



< 



, + 1 V % ~ 1 

This contradicts the conclusion above. So arm a(j) has been 
played at least /3* times. 

If we replace /3* with f3 l and replace arm cr(j) with arm 
cr(i), without changing the proof, we can conclude that arm 
a(i) has been played at least /3 l times, K + 1 < i < N. 

Based on the conclusions above, we can bound the expec- 
tation of the number of non-optimal arm choices. We keep the 
denotation of Tj(n) and I{x} except that here K+l < j < N. 
Every time we select cr(j), there must exist an arm from cr(l) 
to a(K) not being chosen. We denote that unknown arm as 
a(r, t)(if more than one arm not chosen, pick any of them). 

Tjin) = 1 + 2^ H ■ + c ^ 



t=T,?=iBi,G\\t 



V(r,t)(*) 



< 



X °(j) 
ia(j){t) 



(36) 



+ Ct,i 3 } 



And if we replace cr(l) with <j(r, t), according to the 
deduction from ( fl~9l to (|26*1 i. we conclude that 



E(T 3 -(n)) < 1 + max (Ai,j(n) + 7 ' + — ) 

1<?<K O 
7T 2 

= 1 + X K ,j(n) + j + — 



(37) 



where 



^0) - Cp/S, 



2C p . 2 



Therefore, we have: 



r*(n) < KG(n) + ^ (G(n)(^ <T(1) - 

7T 2 



(38) 



3Cp)(A^(n)+ 7 ' + — ) 



Equivalently, we have the simplified form of the bound 
given in the statement of Theorem 2, where: 



-Cp/B q .' 



N . ^+C P /B q , 2 

v ^ r V *"<.}■> -Cp/B„, > 



2 w 

^ = (7' + y) E (/x CT(K) -M CT(i) )+^ 



Z 8 = 3(/V-A)Cp( 7 ' + y) 



s 



C. Corollary 

Similarly to Section IIVI when stationary distribution and 
rewards are available, Bi in Algorithm [2] can be a constant 
sequence. In this way, Algorithm [2] achieves arbitrarily loga- 
rithmic regret over time. Specifically, we have Corollary 2 as 
follows: 

Corollary 2: The system model is the same as that in 
Theorem [2] In Algorithm [2] if 

2C P 



Bi = [max{ 



Cp_ 



,l = l,2 r - ,N}] 
Vi e IN 



then the expected regret after n time slots is at most 
Z' 5 B 1 In n + Z' 6 In n + Z' 7 B 1 + Z' s , where 



j=K+l 
^(i) _ ^ )2] 



2 w 

^ 7 = (72 + y) £ (M^-^J+A- 
4 = 3(iV-if)Cp( 72 + y) 

and here 72 is obtained given q' = 1 in d29| i, (|3T1 >. d32l ). ( l35l > 
and d30>. 

VI. Numerical Results 

In this section, we simulate our algorithm and compare it 
with two previously proposed policies for this problem in the 
context of opportunistic spectrum access: (1) RCA proposed 
by Cem Tekin et al. [14 | and (2) RUCB proposed by H. Liu 
et al. J6) 0. We focus on two properties of the algorithms: 
regret and variance, which show the efficiency and stability of 
the algorithms respectively. 

A. Channel Model and Parameters 

The arms are channels. The channel model is the commonly 
used Gilbert-Elliot model. The state of each channel evolves 
as an irreducible, aperiodic Markov chain. Each channel has 
two states, good and bad. We consider N = 5 channels. At 
each time slot, the player activates 1 channel(i.e. K — 1). The 
active and passive transition matrix for each channel are the 
same, i.e. Pj = Qj, 1 < j < N. For the ease of comparison, 
we set the non-decreasing sequence {Bi} c *L 1 in Algorithm 1 
a constant sequence. 

We simulate three algorithms under scenario S. The transi- 
tion probabilities and rewards for this scenario are shown in 
table D 

Intuitively, in RCA and RUCB, the regret grows with 
L. In our algorithm, the regret grows with both L and 



s 


P01.P10 




ch.l 


0.3, 0.9 


0.1,1 


ch.2 


0.8, 0.7 


0.1,1 


ch.3 


0.5, 0.1 


0.1,1 


ch.4 


0.2, 0.4 


0.1,1 


ch.5 


0.1, 0.5 


0.1,1 



TABLE I 

Transition Probabilities and Rewards for Scenario S 



Bi. For fairness of comparison, we set these parameters 
for all three algorithms to be just passing the theoretical 
bound. In RCA fl4l . the regret has a logarithmic order for 



L > 112S'H,„„r2 



^2 



max max max/ c mm 



/e min , where S max = maxi<i<jv \S l 



N 1 1 auuii' 



r max — ^^x^S*,l<i<N r x> ''max — niax^gg. | i<i< nI^x ; 1 — 

7r* }, e ln i n = mini<i<i<- e l and e 1 is the eigenvalue gap 
of the multiplicative symmetrization of the transition prob- 
ability matrix of the zth arm. In the scenario we set, 



y max max max/ c mm 



is 414.8148. We set L 415 in RCA. 
In CEE Algorithm , we prove that if Bi meets the requirement 
stated in ( fTOb and L > 2, the regret has a logarithmic upper 
bound over time. In scenario S, the lower bound in ( fTOb is 
48.89. We set L 2.1 and B, therefore to 49. In the RUCB al- 

• + 10r 2 ) 



20ri 



gorithm J6), it is required that L > ^(4 — - 2 ^ 
and D > ^[i)_^[K+i))2 ■ The lower bounds are 3125.2 and 
171480 and we accordingly set L = 3126 and D = 171520 
in RUCB. 

We simulate RCA, CEE and RUCB over 10 runs to calculate 
the regret. The time horizon is 100 million. We also show the 
first 8 million time slots of regret to compare the converging 
speed between RCA and CEE. In order to access the stability 
of each algorithm, we also present the variances of rewards 
over 100 runs for RCA, CEE and RUCB. 

The regret performance for all three algorithms are shown 
in Figure |l(a)| and Figure |l(b)| The reward variance for all 
three algorithms is shown in Figure [T(c")| 

B. Discussion 

First of all, we note from the figures that CEE shows 
substantially better regret performance than both RCA and 
RUCB. This is because in CEE, the selection of arm depends 
on the whole observing history, i.e. we exploit observing data 
in every time slot. In RCA, however, the player chooses the 
arm only based on data in the second part of each block 
(sub-block 2, SB2). In this way, CEE uses data much more 
efficiently and the data sample means are much closer to their 
expectations. As for RUCB, in exploration epoch, the player 
selects every arm for certain times thus greatly reducing the 
chances to play the optimal arm. It also shows the advantage 
of continuous exploration and exploitation, which greatly cuts 
down the cost of observing and exploring. 

The second observation is that regret / In time converges 
much more quickly in CEE than in RCA and RUCB. One 
reason is the regret in RCA is much greater than in Algorithm 
1 so it needs more time to reach the stationary point. Besides, 
as stated before, RCA exploits data less efficiently, as the 
sample means are based on only part of the observing history 
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so they converge to the expected value much more slowly. As 
for RUCB, the parameter D is considerably large and it needs 
quite a long time for the length of exploration epoch to grow 
so that an exploitation epoch can appear. The speed of RUCB 
is the slowest among these three algorithms. 

Lastly, we see that the performance of RCA are much more 
random than that in CEE and RUCB. The reward variances 
of RCA are much higher than CEE and RUCB. The reason is 
that the number of time slots between two selection in RCA 
is a random variable. The player stays in the same arm until a 
pre-specified state is observed. In different cases, the length of 
every block may vary a lot. In CEE, however, the length of step 
is a constant number which greatly reduces the randomness. 
In RUCB, the length of each epoch is also a deterministic 
number. Besides, RUCB makes much less choices than CEE 
and RCA. For these two reasons, RUCB also maintains a high 
stability, albeit with poor regret performance. 

In conclusion, CEE outperforms RCA and RUCB in two 
aspects, regret, and convergence speed. The reward variances 
of RUCB and CEE are nearly the same, and much lower than 
RCA. Finally, we should note that because the boundary of 
parameter Bi in dTOb is much smaller than that of parameter 
L in RCA and L and D in RUCB, if we modify RCA and 
RUCB to make them a non-Baysian algorithm, our algorithm 
will converge much faster. 

VII. Conclusion 

In this paper, we have considered the non-Bayesian restless 
multi-arm bandit problem which has been shown to be of 
fundamental significance for opportunistic spectrum access in 
cognitive radio networks. We use a weak notion of regret, 
defined as the gap of expected reward compared to a genie 
who always plays the K best arms. We propose an algorithm 
which achieves a near-logarithmic regret over time when no 
a prior information about the system is available. We also 
present another policy to achieve exact logarithmic regret when 
some bounds pertaining to the stationary state distribution and 
corresponding rewards are known. Compared with prior work, 
this algorithm requires the least information. We have also 
presented numerical results and analysis that show that CEE 
significantly outperforms both of the two previously prosed 
algorithms for this problem, RCA |T4) and RUCB 01, in 



terms of regret and convergence speed, and RCA in terms of 
reward variance. 
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